Criticality-driven Token Dataflow Optimizations for FPGA-based Sparse LU Factorization

ثبت نشده
چکیده

Performance of FPGA-based token dataflow architectures is often limited by the long tail distribution of parallelism in the compute paths of dataflow graphs. This is known to limit speedup of dataflow processing of Sparse LU factorization to only 3– 10× over CPUs. In this paper, we show how to overcome these limitations by exploiting criticality information along compute paths; both statically during graph pre-processing and dynamically at runtime. We statically restructure the high-fanin dataflow chains using a technique inspired by Huffman encoding where we provide faster routes for late arriving inputs as predicted through our timing models. We also perform a fanout decomposition and selective node replication in order to distribute serialization costs across multiple PEs. This static restructuring overhead is small; roughly the cost of a single iteration, and is amortized across 1000s of LU iterations at runtime. Additionally, we modify the dataflow firing rule in hardware to prefer critical nodes when multiple nodes are ready for dataflow evaluation. We compute this criticality offline through a one-time slack analysis and implement this in hardware at virtually no cost through a trivial address encoding ordered by criticality. For dataflow graphs extracted for sparse LU factorization, we demonstrate up to 2.5× (mean 1.21×) improvement when using the static preprocessing alone, a 2.4× (mean 1.17×) improvement when using only runtime optimizations alone while an overall 2.9× (mean 1.39×) improvement when both static and runtime optimizations are enabled across a range of benchmark problems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Criticality-driven Token Dataflow Optimizations for FPGA-based Sparse LU Factorization

Performance of FPGA-based token dataflow architectures is often limited by the long tail distribution of parallelism in the compute paths of dataflow graphs. This is known to limit speedup of dataflow processing of Sparse LU factorization to only 3– 10× over CPUs. In this paper, we show how to overcome these limitations by exploiting criticality information along compute paths; both statically ...

متن کامل

Tag Management in a Reconfigurable Tagged-Token Dataflow Architecture

Combining dataflow concepts with reconfigurable computing provides a great potential to exploit the application parallelism efficiently. However, to express such parallelism cannot be a trivial task. Therefore, there is a great effort to automatically translate programs originally written in procedural languages (like C and Java) into dataflow architectures which express the parallelism in a na...

متن کامل

Out-of-Order Dataflow Scheduling for FPGA Overlays

We exploit floating-point DSPs in the Arria10 FPGA and multi-pumping feature of the M20K RAMs to build a dataflow-driven soft processor fabric for large graph workloads. In this paper, we introduce the idea of out-of-order node scheduling across a large number of local nodes (thousands) per processor by combining an efficient node tagging scheme along with leading-one detector circuits. We use ...

متن کامل

An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads

Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Su...

متن کامل

Parallel Direct Solution of Linear Equations on FPGA-Based Machines

The efficient solution of large systems of linear equations represented by sparse matrices appears in many tasks. LU factorization followed by backward and forward substitutions is widely used for this purpose. Parallel implementations of this computation-intensive process are limited primarily to supercomputers. New generations of Field-Programmable Gate Array (FPGA) technologies enable the im...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017